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A long-running difficulty with conventional game theory has been how to modify it to accom- 
modate the bounded rationality of all real-world players. A recurring issue in statistical physics 
is how best to approximate joint probability distributions with decoupled (and therefore far more 
tractable) distributions. This paper shows that the same information theoretic mathematical struc- 
ture, known as Product Distribution (PD) theory, addresses both issues. In this, PD theory not 
only provides a principled formulation of bounded rationality and a set of new types of mean field 
theory in statistical physics: it also shows that those topics are fundamentally one and the same. 
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I. INTRODUCTION 

In noncooperative game theory, one has a set of N 
players, each choosing its strategy x* independently, by 
sampling a distribution q l (x z ) over those strategies. Each 
player i also has her own utility function g t (x), specify- 
ing how much reward she gets for every possible joint- 
strategy x of all AT players. Let ^(x^)) mean the joint 
probability distribution of all players other than i , i.e., 
U jf H Then the “goal” of each player i is to set q r 

to so that, conditioned on the expected value of i’s 
utility is as high as possible. 

Conventional game theory assumes each player i is 
“fully rational”, able to solve for that optimal Qi , and 
that she then uses that distribution. It is primarily con- 
cerned with analyzing the such equilibria of the game 
[3-6]. In the real world, this assumption of full rational- 
ity almost never holds, whether the players are humans, 
animals, or computational agents [7-15]. This is due to 
the cost of computation of that optimal distribution, if 
nothing else. This real-world bounded rationality is 
one of the major impediments to applying conventional 
game theory in the real world. 

More generally, consider any scientific scenario, in 
which one wishes to make predictions about a particular 
physical system. To make those predictions it is neces- 
sary to first have some information / data concerning the 
system, to serve as the basis of one’s prediction. With- 
out such information, science can say nothing, and to 
pretend otherwise is erroneous. This is true even when 
the physical system is a set of human players engaged in 
a game: To make any predictions concerning the players, 
one must first be provided (or obtain through observa- 
tion) some information concerning them and the game. 
Together with known scientific laws, only that provided 
information should be used in making one’s prediction. 
So in particular, unless one explicitly is provided the in- 
formation that the players in a game are fully rational, 
to simply assume that they are violates one of the fun- 
damental tenets of how science is done. 


This paper shows how Shannon’s information theory 
[16-18] provides a principled way to modify conventional 
game theory to accommodate bounded rationality. This 
is done by following information theory’s prescription 
that, given only partial knowledge concerning the dis- 
tributions the players are using, we should use the min- 
imum information (Maxent) principle to infer those dis- 
tributions. Doing so results in the principle that the 
bounded rational equilibrium is the minimizer of a cer- 
tain set of coupled Lagrangian functions of the joint dis- 
tribution, q(x) — Yii Qi{ x i)' This mathematical structure 
is a special instance of Product Distribution (PD) theory 
[11, 19-24]. 

In addition to showing how to formulate bounded ra- 
tionality, PD theory provides many other advantages to 
game theory. Its formulation of bounded rationality ex- 
plicitly includes a term that, in light of information the- 
ory, is naturally interpreted as a cost of computation. 
PD theory also seamlessly accommodates multiple util- 
ity functions per player. It also provides many powerful 
techniques for finding (bounded rational) equilibria, and 
helps address the issue of multiple equilibria. Another 
advantage is that by changing the coordinates of the un- 
derlying space of joint moves x, the same mathematics 
describes a type of bounded rational cooperative game 
theory, in which the moves of the players are transformed 
into contracts they all offer one another. 

Perhaps the most succinct and principled way of deriv- 
ing statistical physics is as the application of the Maxent 
principle. In this formulation, the problem of statistical 
physics is cast as how best to infer the probability dis- 
tribution over a system’s states when one’s prior knowl- 
edge consists purely of the expectation values of certain 
functions of the system’s state [18, 25]. For example, 
this prescription says we should infer that the probabil- 
ity distribution p governing the system is the Boltzmann 
distribution when our prior knowledge is the system’s 
expected energy. This is known as the “canonical en- 
semble”. Other ensembles arise when other expectation 
values are added to one’s prior knowledge. In particu- 
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lax, if the number of particles in the system is uncertain, 
but one knows its expectation value, one arrives at the 
“grand canonical ensemble” . 

One major difficulty with working with these ensem- 
bles is that under them the particles of the system are sta- 
tistically coupled with one another. For high-dimensional 
systems, this can make statistical physics calculations 
very difficult. Accordingly, a large body of work has been 
produced under the rubric of Mean Field (MF) theory, in 
which the ensemble is approximated with a distribution 
in which the particles are independent [26]. In an MF ap- 
proximation, a product distribution q governs the joint 
state of the particles — just as a product distribution 
governs the joint strategy of the players in a game. 

MF approximations are usually derived in an ad hoc 
manner. The principled way to derive a MF approxima- 
tion (or any other kind) to a particular ensemble is to 
specify a distance measure saying how close two prob- 
ability distributions are, and then solve for the q that 
is closest to the distribution being approximated, p. To 
do this one needs to specify the distance measure. How 
best to measure distances between probability distribu- 
tions is a topic of ongoing controversy and research [27]. 
The most common way to do so is with the infinite limit 
log likelihood of data being generated by one distribution 
but misattributed to have come from the other. This is 
known as the Kullback-Leibler (KL) distance [16, 17, 28]. 
It is far from being a metric. In particular, it is not sym- 
metric under interchange of the two distributions being 
compared. 

It turns out that the simplest MF theories minimize the 
KL distance from q to p. However it can be argued it is 
the KL distance from p to q that is the most appropriate 
measure, not the KL distance from q to p . Using that 
distance, the optimal q is a new kind of approximation 
not usually considered in statistical physics. 

For the canonical ensemble, the type of KL distance 
arising in simple MF theories turns out to be identical 
to the maxent Lagrangian arising in bounded rational 
game theory. This shows how bounded rational (inde- 
pendent) players are formally identical to the particles in 
the MF approximation to the canonical ensemble. Un- 
der this identification, the moves of the players play the 
roles of the states of the particles, and particle energies 
are translated into player utilities. The coordinate trans- 
formations which in game theory result in cooperative 
games are, in statistical physics, techniques for more al- 
lowing the canonical ensemble to be more accurately ap- 
proximated with a product distribution. 

This identification raises the potential of transferring 
some of the powerful mathematical techniques that have 
been developed in the statistical physics community (e.g., 
extensions of mean field theory [26] or cavity methods 
[29]) to noncooperative game theory. In also suggests 
translating some of the other ensembles of statistical 
physics to game theory, in addition to the canonical en- 
semble. As an example, in the grand canonical ensemble 
the number of particles is variable, which, after a MF ap- 


proximation, corresponds to having a variable number of 
players in game theory. Among other applications, this 
provides us with a new framework for analyzing games in 
evolutionary scenarios, different from evolutionary game 
theory. Finally, much work has been done in statisti- 
cal physics on approximations that are higher-order than 
mean-field, introducing extra random variables that al- 
low for some statistical dependencies coupling the vari- 
ables. The associated generalization of PD theory is a 
full-blown theory of Probability Lagrangians. 

In the next section noncooperative game theory and in- 
formation theory are cursorily reviewed. Then bounded 
rational game theory is derived, and its many advantages 
are discussed. The following section starts with a cursory 
review of the informat ion- theoretic derivation of statisti- 
cal physics. After that is a discussion of the two kinds of 
KL distance and the MF theories they induce, and a dis- 
cussion of coordinate systems. This section also includes 
a discussion on translating a MF version of the grand 
canonical ensemble into a new kind of evolutionary game 
theory. 

Miscellaneous proofs can be found in the appendix. 

As discussed in the physics section, the maxent La- 
grangian and associated Boltzmann solution at the core 
of this paper has been investigated for an extremely long 
time in the context of many-particle systems. The use 
of the Boltzmann distribution over possible moves also 
has a long history in the Reinforcement Learning (RL) 
literature, i.e., in the design of algorithms for a player in- 
volved in an iterated game with Nature [30, 31]. Related 
work has considered multiple players [32, 33]. In par- 
ticular, some of that work has been done in the context 
of of “mechanism design” of many players, i.e., in the 
context of designing the utility functions of the players 
to induce them to maximize social welfare [34-37]. In 
all of this RL work the Boltzmann distribution is usually 
motivated either as an a priori reasonable way to trade 
off exploration and exploitation, as part of Markov Chain 
Monte Carlo procedure, or by its asymptotic convergence 
properties [38]. 

In addition, independent of the work reported in this 
paper, the maxent Lagrangian and/or the Boltzmann dis- 
tribution has previously been muted as a way to model 
human players [10, 39, 40]. Some of that work has ex- 
plicitly noted the relation between the Boltzmann distri- 
bution and statistical physics [41]. However the motiva- 
tion of the maxent Lagrangian and Boltzmann distribu- 
tion in that work is ad hoc , based on particular simple 
models of human decision-making and/or of player inter- 
actions. There is no use of information theory to derive 
the maxent Lagrangian from first principles. Due to this, 
no connection is made in that previous work between the 
maxent Lagrangian and the cost of computation, no ex- 
tension is made to other kinds of prior knowledge con- 
cerning the game, there is no recognition of how to mod- 
ify the Lagrangian for multiple cost functions, there is no 
extension to the grand canonical ensemble and therefore 
variable numbers of players, and there is no development 



3 


of rationality operators, or the relation between semi- 
coordinate transformations and cooperative game theory. 
Ultimately, this lack of theoretical underpinnings is also 
why that previous work did not note the formal iden- 
tity between the game theory of actual bounded rational 
human players and MFT. 

Finally, it’s important to note that PD theory also has 
many applications in science beyond those considered in 
this paper. For example, see [21, 22, 42-44] for work re- 
lating the maxent Lagrangian to distributed control and 
to distributed optimization. See [43] for algorithms for 
speeding up convergence to bounded rational equilibria. 
Some of those algorithms are related to simulated and 
deterministic annealing [28]. In [20] others of those 
algorithms are related to Stackelberg games, and more 
generally to the problem of finding the optimal control 
hierarchy for team of players with a common goal, i.e., 
finding an optimal organization chart. See also [45-47] 
for work showing, respectively, how to use PD theory to 
improve Metropolis-Hastings sampling, how to relate it 
to the mechanism design work in [34-37], and how to 
extend it to continuous move spaces and time-extended 
strategies. 

II. PD THEORY AS BOUNDED RATIONAL 
NONCOOPERATIVE GAME THEORY 

This section motivates PD theory as a way of address- 
ing several of the shortcomings of conventional noncoop- 
erative game theory. 

A. Review of noncooperative game theory 

In noncooperative game theory one has a set of N 
players. Each player i has its own set of allowed pure 
strategies. A mixed strategy is a distribution qi(x % ) 
over player Vs possible pure strategies. Each player i also 
has a utility function p* that maps the pure strategies 
adopted by all JV of the players into the real numbers. 
So given mixed strategies of all the players, the expected 
utility of player i is E(gi) = f dx JX> Qj( x j)9i( x ) [54]. 

This basic framework can be elaborated to model 
many interactions between biological organisms, and in 
particular between human beings. These interactions 
range from simple abstractions like the famous prisoner’s 
dilemma to iterated games like chess, to international re- 
lations [3, 4, 48]. 

Much of noncooperative game theory is concerned with 
equilibrium concepts specifying what joint-strategy 
one should expect to result from a particular game. In 
particular, in a Nash equilibrium every player adopts 
the mixed strategy that maximizes its expected utility, 
given the mixed strategies of the other players. More 
formally, Vi,* = argmax g / J dx q\ Qj(xj) 9i(x). 

Several very rich fields have benefited from a close re- 
lationship with noncooperative game theory. Particular 


examples are evolutionary game theory (in which the set 
of N players is replaced by an infinite set of reproduc- 
ing organisms) and cooperative game theory (in which 
players choose which coalitions of other players to join) 
[6, 49]. Game theory as a whole is also closely related to 
economics, in particular the field of mechanism design, 
which is concerned with how to induce the set of players 
to do adopt a socially desirable joint-strategy [3, 50-52]. 


B. Problems with conventional noncooperative 
game theory 

A number of objections to the Nash equilibrium con- 
cept have been resolved. In particular, it was Nash who 
proved that every game has at least one Nash equilib- 
rium if one expands the realm of discourse to include 
mixed strategies. (The same is not true for pure strate- 
gies.) Other objections have been more or less resolved 
through numerous refinements of the Nash equilibrium 
concept. 

However there are several major problems with the 
concept that are still outstanding. One of them is the 
possible multiplicity of equilibria; this multiplicity means 
the Nash equilibrium concept cannot be used to specify 
the joint strategy that is actually adopted in a real world 
game. (Some refinements of the Nash equilibrium con- 
cept attempt to address this problem, though none has 
succeeded.) Another problem is that while calculating 
Nash equilibria is straightforward in many simple games 
(e.g., 2 players in a zero-sum game), calculating them 
in the general case can be a very difficult computational 
multi-criteria optimization problem. Yet another prob- 
lem is that there is no general way to extend the concept 
to allow each player to have multiple utility functions. 

However perhaps the major problem with the Nash 
equilibrium concept is its assumption of full rational- 
ity. This is the assumption that every player i can both 
calculate what the strategies will be and then calcu- 
late its associated optimal distribution. In other words, 
it is the assumption that every player will calculate the 
entire joint distribution q(x) = qj(xj). If for no other 
reasons than computational limitations of real humans, 
this assumption is essentially untenable. This problem is 
just as severe if one allows statistical coupling among the 
players [3, 53]. 

A large body of empirical lore has been generated char- 
acterizing the bounded rationality of humans. Similarly 
much has been learned about the empirical behavior 
of (bounded rational) machine learning computer algo- 
rithms playing games with one another [7, 13]. None of 
this work has resulted in a full mathematical theory of 
bounded rationality however. 

There have also been numerous theoretical attempts 
to incorporate bounded rationality into noncooperative 
game theory by modifying the Nash equilibrium con- 
cept. Some of them assume essentially that every player’s 
mixed strategy is its Nash-optimal strategy with some 
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form of noise superimposed [6]. Others explicitly model 
the humans, typically as computationally limited au- 
tomata, and assume the automata perform optimally 
subject to those computational limitations [10]. Both 
approaches, while providing insight, are very ad hoc as 
models of games involving real-world organisms or real- 
world (i.e., non-trivial) machine learning algorithms. 

The difficulty of calculating equilibria is addressed in 
the sections below on solving for the distributions of PD 
theory. The rest of this section shows how information 
theory can be used to extend game theory to avoid its 
other shortcomings. Finally, the sections after this one 
present some other extensions of game theory, in partic- 
ular to allow for a variable number of players. (Games 
with variable number of players arise in many biological 
scenarios as well as economic ones.) 

C. Review of the minimum information principle 

Shannon was the first person to realize that based 
on any of several separate sets of very simple desider- 
ata, there is a unique real-valued quantification of the 
amount of syntactic information in a distribution P(y). 
He showed that this amount of information is (the nega- 
tive of) the Shannon entropy of that distribution, S(P) — 
-JdyP(y)ln[^)[55). 

So for example, the distribution with minimal infor- 
mation is the one that doesn’t distinguish at all between 
the various y, i.e., the uniform distribution. Conversely, 
the most informative distribution is the one that specifies 
a single possible y. Note that for a product distribution, 
entropy is additive, i.e., 5(11, *(&)) = S(qi). 

Say we given some incomplete prior knowledge about a 
distribution P(y). How should one estimate P(y) based 
on that prior knowledge? Shannon’s result tells us how to 
do that in the most conservative way: have your estimate 
of P(y) contain the minimal amount of extra information 
beyond that already contained in the prior knowledge 
about P(y). Intuitively, this can be viewed as a version 
of Occam’s razor. This approach is called the minimum 
information (or “maxent”) principle. It has proven ex- 
tremely useful in domains ranging from signal processing 
to image processing to supervised learning [17]. 

D. Maxent Lagrangians 

Much of the work on equilibrium concepts in game the- 
ory adopts the perspective of an external observer of a 
game. We are told something concerning the game, e.g., 
the moves sets and utility functions of the separate play- 
ers, information sets, etc., and from that wish to predict 
what joint strategy will be followed by real-world players 
of the game. Say that in addition to such information, 
we are told the expected utilities of the players. What 
is our best estimate of the distribution q that generated 
those expected utility values? By the maxent principle, 


it is the distribution with maximal entropy, subject to 
those expectation values. 

To formalize this, for simplicity assume a finite number 
of players, and a finite number of possible moves (pure 
strategies) for each player. To agree with the convention 
in other fields, from now on we implicitly flip the sign of 
each Qi so that the associated player i wants to minimize 
that function rather than maximize it. Intuitively, this 
flipped gi(x) is the “cost” to player i when the joint- 
strategy is x, rather than its utility then. 

So our prior knowledge is that the players are inde- 
pendent, that their cost functions are the {y*}, and that 
their expected utilities are given by the set of values {e*}. 
The maxent estimate of the q for that prior knowledge is 
given by the minimizer of the Lagrangian 

2{q) = y>[£ 9 Oi) - 6i] - S(<?) 

i 

= &[ f dx n Qi( x i)9i{x) - c,] - 5(g) (1) 

i 3 

where the subscript on the expectation value indicates 
that it evaluated under distribution q , and the {$} are 
Lagrange parameters implicitly set by the constraints on 
the expected utilities [56]. 

Solving, we find that the mixed strategies minimizing 
the Lagrangian are related to each other via 

qi (xi) oc e~ E ^ { ° ]xi) (2) 

where the overall proportionality constant for each i is set 
by normalization, and G = a^d the subscript 

q^i) on the expectation value indicates that it is evalu- 
ated according to the distribution Ylj^i 9j* Eq. 2 the 
probability of player i choosing pure strategy x % depends 
on the effect of that choice on the utilities of the other 
players. This reflects the fact that our prior knowledge 
concerns all the players equally. 

If we wish to focus only on the behavior of player i, 
it is appropriate to modify our prior knowledge. To see 
how to do this, first consider the case of maximal prior 
knowledge, in which we know the actual joint-strategy of 
the players, and therefore all of their expected costs. For 
this case, trivially, the maxent principle says we should 
“estimate” q as that joint-strategy (it being the q with 
maximal entropy that is consistent with our prior knowl- 
edge). The same conclusion holds if our prior knowledge 
also includes the expected cost of player i. 

Now modify this maximal set of prior knowledge by 
removing from it specification of player V s strategy. So 
our prior knowledge is the mixed strategies of all players 
other than together with player i’s expected cost. We 
can incorporate the prior knowledge of the other players’ 
mixed strategies directly into our Lagrangian, without 
introducing Lagrange parameters. That maxent La- 
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grangian is 

= Pi[E{gi) - €i] - Site) 

= 0i[J dx IJ^(xj)yi(i)-€i]-5i(gi). 

All of these Lagrangians (one for each i) are jointly solved 
at a q given by a set of coupled Boltzmann distribu- 
tions: 

Qi (xi) oc e ~ 0iE Hi)( 9iM (3) 

where the {ft} are Lagrange parameters enforcing our 
constraints in the usual way. Following Nash, we can use 
Brouwer’s fixed point theorem to establish that for any 
fixed set of non-negative values {Pi}, there must exist at 
least one product distribution given by the product of 
these Boltzmann distributions (one term in the product 
for each i). 

The first term in Jz is minimized by a perfectly ratio- 
nal player. The second term is minimized by a perfectly 
irrational player, i.e., by a perfectly uniform mixed strat- 
egy qi . So pi in the maxent Lagrangian explicitly specifies 
the balance between the rational and irrational behavior 
of the player. In particular, for p oo, by minimizing 
the Lagrangians we recover the Nash equilibria of the 
game. More formally, in that limit the set of q that si- 
multaneously minimize the Lagrangians is the same as 
the set of delta functions about the Nash equilibria of 
the game. The same is true for Eq. 2. 

The P < oo solutions of Eq. 3 can also be viewed as 
U equilibra” in the conventional game theory sense, of be- 
ing a self-consistent set of mixed strategies of the players. 
To see this, posit that for for each player there is a rule 
(implicit or otherwise) for how it sets its mixed strategy, 
a rule based on the expected costs of each of that player’s 
pure strategies. Say that each player’s rule takes the form 
of a Boltzmann distribution over those expected costs for 
each of the player’s possible pure strategies. (Such a rule 
may reflect cost of computation (see below), desire by 
the player to explore as well as exploit, inherent psycho- 
logical biases, etc.) Then the system is in a bounded 
rational equilibrium for a joint mixed strategy where all 
the players follow their separate rules in a globally con- 
sistent manner. 

Eq. 2 is just a special case of Eq. 3, where all player’s 
share the same cost function G. (Such games are known 
as team games.) Due to this, our guarantee of the 
existence of a solution to the set of maxent Lagrangians 
implies the existence of a solution of the form Eq. 2. 

Typically players aren’t close to perfectly self- 
defeating. Almost always they will be closer to min- 
imizing their expected cost than maximizing it. For 
prior knowledge consistent with such a case, the Pi are 
all non-negative. Examples of games and their associ- 
ated bounded rational equilibria can be found below in 
Sec. II K, after the discussion of rationality operators. 

Finally, our prior knowledge often will not consist of 
exact specification of the expected costs of the players, 


even if that knowledge arises from watching the players 
make their moves. Such other kinds of prior koowledgp 
are addressed in several of the following subsections. 

E. Alternative interpretations of Lagrangians 

There are numerous alternative interpretations of these 
results. For example, change our prior knowledge to be 
the entropy of each player z’s strategy, i.e., how unsure 
it is of what move to make. Now we cannot use informa- 
tion theory to make our estimate of q. Given that players 
try to minimize expected cost, a reasonable alternative 
is to predict that each player V s expected cost will be as 
small as possible, subject to that provided value of the 
entropy and the other players’ strategies. The associated 
Lagrangians are a,[5(^) — ap — E(gi ), where cq is the 
provided entropy value. This is equivalent to the max- 
ent Lagrangian, and in particular has the same solution, 
Eq. 3. 

Another alternative interpretation involves world 
cost functions, which are quantifications of the qual- 
ity of a joint pure strategy x from the point of view 
of an external observer (e.g., a system designer, the 
government, an auctioneer, etc.). A particular class 
of world cost functions are “social welfare functions”, 
which can be expressed in terms of the cost functions 
of the individual players. Perhaps the simplest example 
is G(x) = PiQi{x ), where the ft serve to trade off how 

much we value one player’s cost vs. anothers. If we know 
the value of this social welfare function, but nothing else, 
then maxent tells us to minimize the Lagrangian of Eq. 1. 

An important aspect of any of these interpretations is 
that typically one does not have to explicitly specify the 
values in one’s “prior knowledge”. This is because typ- 
ically the Lagrange parameters are montonic functions 
of those “prior knowledge” values [43]. So it suffices to 
specify the values of the Lagrange parameters; the ex- 
pected value “prior knowledge” is purely nominal. This 
is formalized in the subsection on rationality operators, 
where the prior knowledge is explicitly formulated as the 
values of Lagrange parameters. 

F. Bounded rational game theory 

In many situations we have prior knowledge different 
from (or in addition to) expected values of cost functions. 
This is particularly true when the players are human be- 
ings (so that behavioral economics studies can be brought 
to bear) or simple computational algorithms. To apply 
information theory in such situations, we simply need to 
incorporate that prior knowledge into our Lagrangian(s). 

To give a simple example, say that we know that the 
players all want to ensure not just a low expected cost, 
but also that the actual cost doesn’t vary too much from 
one sample of q to the next. We can formalize this by say- 
ing that in addition to expected costs, our prior knowl- 
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edge includes variances in the costs. Given the expected 
values of the costs, such variances are specified by the ex- 
pected values of the squares of the cost. Accordingly, all 
our prior knowledge is in the form of expectation values. 
Modifying Eq. 3 appropriately, we arrive at the solution 

qi{xi) oc e - £ W a ‘ <(s<_Ai)2 l Xi) . 

w^here the Lagrange parameters and A* are given by 
the provided expectations and variances of the costs of 
the players. 

Eq. 4 is our best guess for what the actual mixed strat- 
egy of player i is, in light of our prior knowledge concern- 
ing that player. Note that this formula directly reflects 
the fact that player i does not care only about minimiz- 
ing cost, i.e., maximizing utility. In this, we are directly 
incorporating the possibility that the player violates the 
axioms of utility theory — something never allowed in 
conventional game theory. Other behavioral economics 
phenomena like risk aversion can be treated in a similar 
fashion. 

A variant of this scenario would have our prior knowl- 
edge only give the variances of the costs of the players 
and not their expected costs. In this cost the Lagrangian 
must involve a term quadratic in q, in addition to the 
entropy term and a term linear in q. (See the subsection 
on multiple cost functions.) More generally, our prior 
knowledge can be any nonlinear function of q. In addi- 
tion, even if we stick to prior knowledge that is linear in g, 
that knowledge can couple the cost functions of the play- 
ers. For example, if we know that the expected difference 
in cost of players i and j is e, the associated Lagrange 
constraint term is f dxq{x)[gi{x ) — gj(x) — e]. In this sit- 
uation our prior knowledge couples the strategies of the 
players, even though those players are independent. See 
the discussions on constrained optimization in [21, 23]. 

G. Cost of computation 

As mentioned above, bounded rationality is an un- 
avoidable consequence of the cost of computation to 
player i of finding its optimal strategy. Unfortunately, 
one cannot simply incorporate that cost into g t , and then 
presume that the player acts perfectly rationally for this 
new gi. The reason is that this cost is associated with the 
entire distribution qi (x t ) that player i calculates; it not 
associated with some particular joint-strategy formed by 
sampling such a distribution. 

How might we quantify the cost of calculating q{t The 
natural approach is to use information theory. Indeed, 
that cost arises naturally in the bounded rationality for- 
mulation of game theory presented above. To see how, 
for each player i define 

fi(x,qi(xi)) = (3i9i{x) + ln[ 9l (xi)]. 

Then we can write the maxent Lagrangian for player i as 

X(q) = J dxq(x)fi{x,qi(xi)). (6) 


Now in a bounded rational game every player sets its 
strategy to minimize its Lagrangian, given the strategies 
of the other players. In light of Eq. 6, this means that we 
can interpret each player in a bounded rational game as 
being perfectly rational for a cost function that incorpo- 
rates its computational cost. To do so we simply need to 
expand the domain of “cost functions” to include proba- 
bility values as well as joint moves. 

Similar results hold for non-maxent Lagrangians. All 
that’s needed is that we can write such a Lagrangian in 
the form of Eq. 6 for some appropriate function f % . 


H. Shape of the Lagrangian surface 

In this subsection we consider 2zf* as a function of q , 
with pi and both treated as fixed parameters. (So in 
particular, E q {gi) need not equal e 2 .) 

First, say that q (q is held fixed, with only qi allowed 
to vary. This makes E(gi) be linear in g^. In addition, 
entropy is a concave function, and the unit simplex is a 
convex region. Accordingly, the Lagrangian of Eq. 3 has 
a unique local minimum over q t . So there is no issue of 
choosing among multiple minima when all of q is fixed. 
Nor is there any problem of “getting trapped in a local 
minimum” in a computational search for that minimum. 
Indeed, in this situation we can just jump directly to that 
global optimum, via Eq. 3. All of this is also true if we 
are considering the Lagrangian rather than Jzf 2 ; the 
function from V s strategy to j’s Lagrangian has a single 
optimum, interior to Vs simplex. 

Now introduce the shorthand 

[U] i,p(%i) = J dX(j^U {X{, )p{X(i) | 

so that [9i]i,q {i) {xi) is player Vs effective cost function, 

I x i)- Consider the value E q B ([#]*, This 
is the value of E{gi) at Vs bounded rational equilib- 
rium for the fixed i.e., it is the value at the min- 
imum over qi of Jzf 2 . View that value as a function of 
Pi. One can show that this is a decreasing function. In 
fact, its derivative just equals the negative of the variance 
of [gi]i,q {i) (xi) evaluated under distribution qf(xi). Since 
E(gi ) is bounded below (for bounded g 2 ), this means that 
that variance must go to zero for large enough pi. So 
as pi grows, qf{x % ) — > 0 for all Xi that don’t minimize 
Eq {i) (gi | Xi). In other words, in that limit, q % becomes 
Nash-optimal. 

Next consider varying over all g 6 2, the space of all 
product distributions q. This is a convex space; if p € Q 
and p f € 2, then so is any distribution on the line con- 
necting p and p'. However over this space, the E(g t ) term 
in Jzf 2 is multilinear. So Jzf* is not a simple convex func- 
tion of q. This is true even for a team game, with shared 
Pi , for which case every i has the same Lagrangian. So 
we do not have the guarantees of a single local minimum 
provided by convexity even in this case. 
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To further analyze the shape of the team game La- 
grangian as a function of q, wc start with the following 
lemma, which extends the technique of Lagrange param- 
eters to off-equilibrium points: 

Lemma 1: Consider the set of all vectors leading from 
x' £ W 1 that are, to first order, consistent with a set of 
constraints over M n . Of those vectors, the one giving the 
steepest ascent of a function V (x) is u — W+JT A» V/i, 
up to an overall proportionality constant, where the A z 
enforce the first order consistency conditions, u ■ V/» — 
0 Vt. 

Note that the gradient of entropy is infinite at the bor- 
der of <2, since at least one ln(g z ) term will be negative 
infinite there. Combined with Lemma 1, this can be used 
to establish that at the edge of Q , the steepest descent 
direction of any player’s Lagrangian points into the inte- 
rior of Q (assuming finite 0 and (This is reflected 

in the equilibrium solutions Eq. 3.) Accordingly, whereas 
Nash equilibria can be on the edge of Q (e.g., for a pure 
strategy Nash equilibrium), in bounded rational games 
any equilibrium must lie in the interior of Q. In other 
words, any equilibrium (i.e., any local minimum) of a 
bounded rational game has non-zero probability for all 
joint moves. So just as when only varying a single g t , 
we never have to consider extremal mixed strategies in 
searching for equilibria over all Q. We can use local de- 
scent schemes instead [21, 23, 43]. 

Lemma 1 can also be used to construct examples of 
games with more than one bounded rational equilibrium 
(just like there are games with more than Nash equilib- 
rium). One can also show that for every player i and 
any point q interior to Q, there are directions in Q along 
which z’s Lagrangian is locally convex. Accordingly, no 
player’s Lagrangian has a local maximum interior to Q. 
So if there are multiple local minima of V s Lagrangian, 
they are separated by saddle points across ridges. In ad- 
dition, the uniform q is a solution to the set of coupled 
equations Eq. 3 for a team game, but typically is not a 
local minimum, and therefore must be a saddle point. 

Say we modify the Lagrangians to be defined for all 
possible p, not just those that are product distributions. 
For example the Lagrangian of Eq. 1 becomes 

■&(p) = J2 &[ [ 9i( x )p( x ) - e<] - S(p). 

i J 

The first term in this Lagrangian is linear in p. Since en- 
tropy is a concave function of the Euclidean vector p over 
the unit simplex, this means that the overall Lagrangian 
is a convex function of p over the space of allowed p. This 
means there is a unique mirnirm yn of the Lagrangian over 
the space of all possible legal p. Furthermore, as men- 
tioned previously, for finite 0 at least one of the deriva- 
tives of the Lagrangian is negative infinite at the border 
of the allowed region of p . This means that the unique 
minimum of the Lagrangian is interior to that region, i.e., 
is a legal probability distribution. 


In general this optimal p will not be a product dis- 
tribution, of course. Rather the strategy choices of the 
players are typically statistically coupled, under this p. 
Such coupling is very suggestive of various stochastic for- 
mulations of noncooperative game theory. Coupling also 
arises in cooperative game theory, in which binding con- 
tracts couple the moves of the players [6, 48], 

Similarly, as in proven in the appendix, the Lagrangian 
Jz?(p) 0 “ S(p) is convex over the manifold 
of legal p, assuming non-negative 0. So the model of 
mechanism design introduced in Sec. Ill has a unique 
equilibrium — if we allow the players to be statistically 
coupled. 


I. Multiple cost functions per player 

Say player i has several different cost functions {gj} 
and wants to choose a strategy that will do well at all of 
them. In the case of pure strategies we can simply “roll 
up” the cost functions into an aggregate function and 
employ that in a conventional, single-cost-function-per- 
player game theoretic analysis. An aggregate cost func- 
tion like — would not necessarily work, since it may 

be that the pure strategy x minimizing that sum results 
in a relatively large value for one of the g{ (x). However by 
construction, minimizing a function like maxj^ (x) will 
ensure that no particular cost function is favored over the 
others. Player i will perform well according to such an 
aggregate function iff it performs well according to all of 
the constituent gj. 

One might think that for mixed strategies one could 
similarly roll up the cost functions and say that player 
i works to minimize an aggregate cost function. How- 
ever especially when player i has many cost functions, 
it may be that performance according to one or more of 
the constituent cost functions is quite bad even though 
the performance according to this average function is 
good. In particular, it may be that player i has rela- 
tively low value of the expectation of the maximum of 
its cost functions, even though the maximum of the ex- 
pected costs is quite high [57]. More generally, we can- 
not ensure that the expected costs of player i, E q (gj) = 
J dx gl (x)gi(x)g(i)(x(i)), all have good values by appro- 
priately defining an aggregate g l and requiring only that 
/ dx gi(x)qi(x)q (*)(£(;)) is good. Instead, we must rede- 
fine the goal of “minimizing expected costs”. 

One way to reformulate our goal proceeds by analogy 
with the goal typically ascribed to a player in pure strat- 
egy games. This analogy is based on viewing the cost 
function for player i as controlled by a fictional player 
in a meta-game. Conventional game theory analyzes the 
case where player i chooses a pure strategy to minimize 
the worst case (over other players’ moves) cost to i, i.e., 
to minimize max X(i) g l (xi,X(i). Here the analogy would 
be for the player to choose a mixed strategy to mini- 
mize the worst case (over moves by the fictional player) 
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expected cost, i.e., to minimize ma XjE q (gl). 

A similar solution, appropriate when all of the cost 
functions are nowhere-negative, is for player i to mini- 
mize ^2j[Eg(gl)] 2 • Due to the convexity of the squaring 
operator such minimization will help ensure that no sin- 
gle expectation value E q (g{) is too high [58]. Indeed, 
consider increasing the power we raise the costs to, get- 
ting the function [52j[E q (g { )] n ] 1 / n . Minimizing this for 
large n will approximate the lim-sup norm, which would 
force all g\ to have the same (as low as possible) expec- 
tation value. 

As far as the math is concerned, J2j[Eq(gi)} 2 is just 
a “Lagrangian” of q , one that is convex like the La- 
grangian in Eq. 3. If we wish, we can modify such a 
Lagrangian to incorporate bounded rationality, to force 
the solution to be interior to Q , getting Lagrangians like 
Pj[Eq(gi)] 2 - S(qi ), where the f3j determine the rela- 
tive rationalities of player i according to its various cost 
functions. 

These kinds of Lagrangians can also model the pro- 
cess of mechanism design, where there is an external 
designer who induces the players to adopt a desirable 
joint-strategy [3]. As an example, “desirable” sometimes 
means that no single player’s expected cost is high. A 
system that meets this goal fairly well can be modeled 
with a Lagrangian involving terms like ^2i[E q (gi)} 2 . 

J. Rationality operators 

Often our prior knowledge will not concern expected 
costs. In particular, this is usually true if our prior knowl- 
edge is provided to us before the game is played, rather 
than afterward. In such a situation, prior knowledge will 
more likely concern the “intelligences” of the players, i.e., 
how close they are to being rational. In particular, if 
we want our prior knowledge concerning player i to be 
relatively independent of what the other players do, we 
cannot use z’s expected cost as our prior knowledge. Our 
prior knowledge will often concern how peaked fs mixed 
strategy is about whichever of its moves minimize its cost 
(or how peaked we can assume it to be), not the associ- 
ated minimal cost values. 

Formally, the problem faced by player i is how to set 
its mixed strategy qi(xi) so as to maximize the expected 
value of its effective cost function, E(gi \ Xi). General- 
izing, what we want is a rationality operator R(U,p) that 
measures how peaked an arbitrary distribution p(y) is 
about the minimizers of an arbitrary cost function U(y), 
argminyU (y) . 

Formally, we make two requirements of R : 

1. If p(y) (x e~P u(<y \ for non-negative /?, then it is 
natural to require that the peakedness of the dis- 
tribution — its rationality value — is (3 . 

2. We also need to also specify something of i?(£/,p)’ s 
behavior for non-Boltzmann p. It will suffice to 
require that of the p satisfying R(U,p) = /3, the 


one that has maximal entropy is proportional to 
e -pu(y) j n otfrgr words, we require that the Boltz- 
mann distribution maximizes entropy subject to a 
provided value of the rationality operator. 

As an illustration, a natural choice for R(U } p) would be 
the (3 of the Boltzmann distribution that “best fits” p. 
Information theory provides us such a measure for how 
well a distribution pi is fit by a distribution P 2 - This is 
the Kullback-Leibler distance [16, 28]: 

KL(pi || p 2 ) = S(px || p 2 ) - S(pi ) (8) 

where S(p i || p 2 ) = - f dy Pi(y)H^y] is known 35 
the cross entropy from p\ to P 2 (and as usual we im- 
plicitly choose uniform p). The KL distance is always 
non-negative, and equals zero iff its two arguments are 
identical. 

Define that N(U) == / dy e~~ u ( y \ the normaliza- 
tion constant for the distribution proportional to e~ u ^ y Y 
(This is called the partition function in statistical 
physics.) Then using the KL distance, we arrive at the 
rationality operator 

e -pu 

Rkl(U,p ) = a.rgmin 0 K L(p || 

— argmin p\0 J dyp{y)U{y) + HN{(3U))]. 

In the appendix it is proven that Rkl respects the two 
requirements of rationality operators. 

The quantity \n{N((3U)) appearing in the second equa- 
tion, when scaled by /3 _1 , is called the free energy. It 
is easy to verify that it equals the Lagrangian E P (U) — 
S(p) / (3 if p is given by the Boltzmann distribution p{y) oc 
e -0 u (y). 

Say our prior knowledge is {pi}, the rationalities of the 
players for their associated effective cost functions. Then 
the Lagrangian for our prior knowledge is 

&(q) = - S(q). (9) 

i 

where the A* are the Lagrange parameters. Just as be- 
fore, there is an alternative way to motivate this Lagan- 
gian: if our prior knowledge consists of the entropy of 
the joint system, and we assume each player will have 
maximal rationality subject to that prior knowledge, we 
are led to the Lagrangian of Eq. 9. 

It is shown in the appendix that for the Kullback- 
Leibler rationality operator, we can replace any con- 
straint of the form i?([&] ij(? , Qi) = Pi with E q (gi) = 

i dx ^( g ) e ~^y- g(o (»(«))• in ° ther w ° rds ’ kn ° wing 

that player i has KL rationality pi is equivalent to know- 
ing that the actual expected value of gi equals the “ideal 
expected value”, where q t is replaced by the Boltzmann 
distribution of Eq. 3 with 0 = pi- This contrasts with 
the prior knowledge underlying the Lagrangian in Eq. 1, 
in which we know the actual numerical value of E q {gi). 
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Just as before, we can focus on player i by augmenting 
our prior Knowledge Lu muuuc tbc strategies of all tbc 
other players. The associated Lagrangian is 

-^t(<Z») ~ Qi) “ pi] “ S(qi). (10) 

(The prior knowledge concerning the strategies of the 
other players is manifested in the effective cost function.) 
It is shown in the appendix that the set of all the La- 
grangians in Eq. 10 (one for each player) are minimized 
simultaneously by any distribution of the form 

a3 = Ui£^±l 

q ^(Pibik,) 

In addition, since this distribution obeys all the con- 
straints in the Lagrangian in Eq. 9, we know that there 
exists a minimizer of that Lagrangian. All of this holds 
regardless of the precise rationality operator one uses. 

Note that the Lagrangian Jzf* of Eq. 10 for player i 
arises in response to prior knowledge specific to player i. 
Changing from one player and its Lagrangian to another 
changes the prior knowledge. The same is true for the 
Lagrangians in Eq. 3. 

In contrast, the Lagrangian of Eq. 9 arises for a sin- 
gle unified body of prior knowledge, namely the set of 
all players’ rationalities. For that single body of knowl- 
edge, the equilibrium of the game is the solution to a 
single - objective optimization problem. This contrasts 
with the conventional formulation of full rationality game 
theory, where the equilibrium is cast as a solution to a 
multi-objective optimization problem (one objective per 
player). Furthermore, as usual, for finite fi at least one 
of the derivatives of the Lagrangian is negative infinite 
at the border of the allowed region of product distribu- 
tions (i.e., at the border of the Cartesian product of unit 
simplices). Accordingly, all solutions he in the interior 
of that region. This can be a big advantage for finding 
such solutions numerically, since it allows one to use local 
descent algorithms. 

K. Examples of bounded rational equilibria 

It can be difficult to write down a set of cost functions 
and associated rationalities /?* and then solve for the as- 
sociated bounded rational equilibrium. Starting with ex- 
pected costs rather than rationalities (so the Pi are not 
specified upfront but instead are Lagrange parameters 
that we must solve for) can be even more tedious. How- 
ever there is a simple alternative way to construct exam- 
ples of games and their bounded rational equilibria. In 
this alternative one starts with a particular mixed strat- 
egy q and then solves for a game for which q is a bounded 
rational equilibrium, rather than the other way around. 

To illustrate this, consider a 2-person noncooperative 
single-stage game. Let each player have 3 possible moves. 
Indicate each players’ three possible moves by the nu- 
merals 0, 1, and 2. Say the (bounded rational) mixed 


strategy equilibrium is 

<Zi( 0) - 1/2, (1) - 1/4, <7: (2) = 1/4; 

02(0) = 2/3, <? 2 (1) = 1/4, ^(2) = 1/12 . (12) 

Now we know that at the equilibrium, qi(xi) oc 
e -PiE(g i|xi)^ w h ere ^ player l’s rationality, and g± 

is her cost function (the negative of her utility function). 

This means for example that 

exp (0i[E(g 1 | xi = 0) - E(g x | = 1)]) = = 2; 

Pi[E(gi | xi = 0) - E(gi | x x = 1)] = -ln(2).(13) 

We have a similar equation for the remaining indepen- 
dent difference in expectation values for player 1. The 
analogous pair of equations for player 2 also hold. 

Now define the vectors g i;J (.) = g t (xi — j, .). So for 
example g i;0 = (2i(*i = 0,x 2 = 0),£i(xi = 0,x 2 = 
I)i5i(^i = 0,x 2 = 2)). Then we can express our equa- 
tions compactly as four dot product equalities: 

Pl(gl;0 — gl;l) * <?2 — — ln(2), /3l(gl-0 ” gl;2) * 02 = — ln(2) ; 

&2 (§2;G ~ g2;l) * Ql = -lll(8/3), A>(g2;0 “ g2;2) * Ql = 

Note that we can absorb each pi into its associated g z \ 
all that matters is their product. We can now plug in for 
the vectors qi and q 2 from Eq. 12 and simply write down 
solutions for the four three-dimensional vectors g 2?J . If 
desired, we can then evaluate the associated expected 
values of the cost functions for the two players. 

Note that the variables in the first pair of equalities in 
Eq. 14 are independent of those in the second pair. In 
other words, whereas the Boltzmann equations giving q 
for a specified set of g t are a set of coupled equations, the 
equations giving the g x for a specified q are not coupled. 

Note also that our equations for the g t;j are (extremely) 
underconstrained. This illustrates how compressive the 
mapping from the g t to the associated equilibrium q is. 

Bear in mind though that that mapping is also multi- 
valued in general; in general a single set of cost functions 
can have more than one equilibrium, just like it can have 
more than one Nash equilibrium. 

The generalization of this example to arbitrary num- 
bers of players with arbitrary move spaces is immediate. 

As before, indicate the moves of every player by an as- 
sociated set of integer numerals starting at 0. Let the 
subscript (z) on a vector indicate all components but the 
z’th one. Also absorb the rationalities (3 t into the associ- 
ated gi. 

Now specify q and the vectors gi(xi = 0,.) (one vec- 
tor for each i) to be anything whatsoever. Then for 
all players z, the only associated constraint on the z’th 
cost function concerns certain projections of the vectors 
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gi(xi > 0,.) ( one projection for each value x t ): 


/ 


dX(i)9i( x ij 


-ln(^#r) + 


Qi(xi)' 


/ 


where £(.) is the mapping from xto z, and P x and P z are 
the distributions across x-space and z-space, respectively. 
To see what this rule means geometrically, let V be the 
space of all distributions (product or otherwise) over z’s. 


i.e., 


gi •) ' Q(i) — 




L. Semi-coordinate systems 

Consider a multi-stage game like chess, with the stages 
(i.e,, the instants at which one of the players makes a 
move) delineated by t. Now strategies are what are set 
by the players before play starts. So in such a multi-stage 
game the strategy of player z, X* , must be the set of t - 
indexed maps taking what that player has observed in 
the stages t f < t into its move at stage t. Formally, this 
set of maps is called player z’s normal form strategy. 

The joint strategy of the two players in chess sets their 
joint move-sequence, though in general the reverse need 
not be true. In addition, one can always find a joint 
strategy to result in any particular joint move-sequence. 
Typically there is overlap in what the players in chess 
have observed at stages preceding the current one. This 
means that even if the players’ strategies are statistically 
independent, their move sequences are statistically cou- 
pled. In such a situation, by parameterizing the space of 
joint- move-sequences z with joint-strategies x, we shift 
our focus from the coupled distribution P(z) to the de- 
coupled product distribution, q(x). This is the advan- 
tage of casting multi-stage games in terms of normal form 
strategies. 

More generally, any onto mapping £ : x -+ z, not neces- 
sarily invertible, is called a semi-coordinate system. 
The identity mapping z — » z is a trivial example of a 
semi-coordinate system. Another example is the map- 
ping from joint-strategies in a multi-stage game to joint 
move-sequences is an example of a semi-coordinate sys- 
tem. So changing the representation space of a multi- 
stage game from move-sequences z to strategies x is a 
semi-coordinate transformation of that game. 

We can perform a semi-coordinate transformation even 
in a single-stage game. Say we restrict attention to dis- 
tributions over spaces of possible x that are product dis- 
tributions. Then changing £(.) from the identity map 
to some other function means that the players are no 
longer independent. After the transformation their strat- 
egy choices — the components of z — are statistically 
coupled, even though we are considering a product dis- 
tribution. 

Formally, this is expressed via the standard rule for 
transforming probabilities, 

P z (z) = C(Px) = / dxP x (x)8{z - £(*)), (17) 


f t f Recall that Q is the space of all product distributions 

1 1 Qj ( x j W&($)and let £(Q) be the image of Q in V . Then by 
changing £(.), we change that image; different choices of 
.£(*) will result in different manifolds £(Q). 

example, say we have two players, with two pos- 
ategies each. So z consists of the possible joint 
strategies, labeled (1, 1), (1, 2), (2, 1) and (2, 2). Have the 
space of possible x equal the space of possible z, and 
choose C(l,l) = (1,1), C(l,2) = (2,2), £(2,1) - (2,1), 
and £(2,2) = (1,2). Say that q is given by qi(x\ = 
1) = q 2 (x 2 = 1) = 2/3. Then the distribution over 
joint-strategies z is P z ( 1, 1) = P x ( 1, 1) — 4/9, jP z (2, 1) = 
Pz( 2,2) - 2/9, P,(l,2) - 1/9. So P z (z) / P^PM; 
the strategies of the players are statistically coupled. 

Such coupling of the players’ strategies can be viewed 
as a manifestation of sets of potential binding contracts. 
To illustrate this return to our two player example. Each 
possible value of a component Xi determines a pair of 
possible joint strategies. For example, setting x\ = 1 
means the possible joint strategies are (1,1) and (2,2). 
Accordingly such a value of x* can be viewed as a set 
of proffered binding contracts. The value of the other 
components of x determines which contract is accepted; 
it is the intersection of the proffered contracts offered 
by all the components of x that determines what single 
contract is selected. Continuing with our example, given 
that x\ = 1, whether the joint-strategy is (1, 1) or (2,2) 
(the two options offered by xi) is determined by the value 
of x 2 . 

Binding contracts are a central component of coopera- 
tive game theory. In this sense, semi-coordinate transfor- 
mations can be viewed as a way to convert noncoopera- 
tive game theory into a form of cooperative game theory. 

While the distribution over x uniquely sets the distri- 
bution over z, the reverse is not true. However so long as 
our Lagrangian directly concerns the distribution over x 
rather than the distribution over z, by minimizing that 
Lagrangian we set a distribution over z. In this way 
we can minimize a Lagrangian involving product distri- 
butions, even though the associated distribution in the 
ultimate space of interest is not a product distribution. 

The Lagrangian we choose over x should depend on our 
prior information, as usual. If we want that Lagrangian 
to include an expected value over z’s (e.g., of a cost func- 
tion), we can directly incorporate that expectation value 
into the Lagrangian over x’s, since expected values in x 
and z are identical: f dzP z (z)A(z ) = f dxP x (x)A(((x )) 
for any function A(z). (Indeed, this is the standard justi- 
fication of the rule for transforming probabilities, Eq. 17.) 

However other functionals of probability distributions 
can differ between the two spaces. This is especially com- 
mon when £(.) is not invertible, so the space of possible 
x is larger than the space of possible z. For example, 
in general the entropy of a q € Q will differ from that 
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of its image, ((g) € ((2) hi such a case. (The prior 
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invariance when the two spaces have the same cardinal- 
ity.) A correction factor is necessary to relate the two 
entropies [46]. 

In such cases, we have to be careful about which space 
we use to formulate our Lagrangian. If we use the trans- 
formation ((.) asa tool to allow us to analyze bargaining 
games with binding contracts, then the direct space of 
interest is actually the x’s (that is the place in which the 
players make their bargaining moves). In such cases it 
makes sense to apply all the analysis of the preceding 
sections exactly as it is written, concerning Lagrangians 
and distributions over x rather than z (so long as we re- 
define cost functions to implicitly pre-apply the mapping 
((.) to their arguments). However if we instead use ((.) 
simply as a way of establishing statistical dependencies 
among the strategies of the players, it may make sense 
to include the entropy correction factor in our x-space 
Lagrangian. 

An important special case is where the following three 
conditions are met: Each point z is the image under 
C(-) of the same number of points in x-space, n; p(x) 
is uniform (and therefore so is /i(z)); and the Lagrangian 
in x-space, is a sum of expected costs and the en- 
tropy. In this situation, consider a z-space Lagrangian, 
J*? z , whose functional dependence on P Zl the distribution 
over z’s, is identical to the dependence of on P x , ex- 
cept that the entropy term is divided by n [59]. Now 
the minimizer P* (x) of is a Boltzmann distribution 
in values of the cost function(s). Accordingly, for any 
z, P*(x) is uniform across all n points x € C~ 1 ( z ) (ah 
such x have the same cost value(s)). This in turn means 
that S(((P r )) = nS(P z ) So our two Lagrangians give the 
same solution, i.e., the “correction factor” for the entropy 
term is just multiplication by n. 

M. Entropic prior game theory 

Finally, it is worth noting that in the real world the 
information we are provided concerning the system often 
will not consist of exact values of functionals of g, be those 
values expected costs, rationalities, or what have you. 
Rather that knowledge will be in the form of data, D, 
together with an associated likelihood function over the 
space of g. For example, that knowledge might consist of 
a bias toward particular rationality values, rather than 
precisely specified values: 

P{D | q) oc 

where a sets the strength of the bias. 

The extension of the minimum information principle to 
such situations uses the entropic prior, P(g) oc e~ lS ^ q \ 
Bayes’ theorem is then invoked to get the posterior dis- 
tribution [18]: 

P(g | D ) oc 


The Bayes optimal estimate for g, under a quadratic 
penalty term is then givp.rt by E(q j D). The maxent 
principle for estimating q is given by this estimate under 
the limit of all a* going to infinity. For finite a solv- 
ing for E(q | D) can be quite complicated though. For 
simplicity, such cases are not considered here. 

III. PD THEORY AND STATISTICAL PHYSICS 

There are many connections between bounded ratio- 
nal game theory — PD theory — and statistical physics. 
This should not be too surprising, given that many of the 
important concepts in bounded rational game theory, like 
the Boltzmann distribution, the partition function, and 
free energy, were first explored in statistical physics. This 
section discusses some of these connections. 


A. Background on statistical physics 

Statistical physics is the physics of systems about 
which we have incomplete information. An example is 
knowing only the expected value of a system’s energy 
(i.e., its temperature) rather than the precise value of the 
energy. The statistical physics of such systems is known 
as the canonical ensemble. Another example is the 
grand canonical ensemble (GCE). There the number 
of particles of various types in the system is also uncer- 
tain. As in the canonical ensemble, in the GCE what 
knowledge we do have takes the form of expectation val- 
ues of the quantities about which we are uncertain, i.e., 
the number of particles of the various types that the sys- 
tem contains, and the energy the system. 

Traditionally these kinds of ensembles were analyzed 
in terms of “baths” of the uncertain variable that are 
connected to the system. For example, in the canonical 
ensemble the system is connected to a heat bath. In the 
GCE the system is also connected to a bath of particles 
of the various types. 

Such analysis showed that for the canonical ensem- 
ble the probability of the system being in the particular 
state x is given by the Boltzmann distribution over the 
associated value of the system’s energy, G(x), with 0 
interpreted as the (inverse) temperature of the system: 
p(x) oc This result is independent of the details 

characteristics of the physical system; all that is impor- 
tant is the Hamiltonian G(x), and temperature /3. 

Note that once one knows p(x) and G(x), one knows 
the expected energy of the system. It is G{x) that is a 
fixed property of the system, whereas /3 can vary. Ac- 
cordingly, specifying 0 is exactly equivalent to specifying 
the expected energy of the system. 

In the case of the GCE, x implicitly specifies the num- 
ber of particles of the various types, as well as their 
precise state. The analysis for that case showed that 
p(x) oc In this formula 0 is again the 

inverse temperature, n* is the number of particles of type 
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i, and pi > 0 is the chemical potential of each particle 
of type i. 

Jaynes was the first to show that these results of con- 
ventional statistical physics could be derived without re- 
course to artificial notions like “baths” , simply by using 
the maxent principle. In particular, he used the exact 
reasoning in Sec. II F to derive the fact that the canoni- 
cal ensemble is governed by the Boltzmann distribution. 


B. Mean field theory and PD theory 


In practice it can be quite difficult to evaluate this 
Boltzmann distribution, due to difficulty in evaluating 
the partition function. For example, in a spin glass, 
x is an TV-dimensional vector of bits, one per particle, 
and G(x) = Ylij So the partition function 

is given by f dxe~^'>j HijXiXj , where H is a symmet- 
ric real- valued matrix, and as before we use f to indicate 
the integral according to the appropriate measure (here 
a point-sum measure). In general, evaluating this sum 
for large numbers of spins cannot be done in closed form. 

Mean Field (MF) theory is a technique for getting 
around this problem by approximating the partition 
function. Intuitively, it works by treating all the parti- 
cles as independent. It does this by replacing some of the 
values of the state of a particle in the Hamiltonian by its 
average state. For example, in the case of the spin glass, 
one approximates j Hijfci - E(xi)][xj - E(xj ) ] ~ 0, 
where the expectation values are evaluated according to 
the associated exact Boltzmann distribution, i.e., one as- 
sumes that fluctuations about the means are relatively 
negligible. This then means that 

G{x)~Y. H ^ 2 x ' E{x i) ~ Y^HijEixiWxj), 

i,3 hj 


The second sum in this approximation cancels out when 
we evaluate the associated approximate Boltzmann dis- 
tribution, leaving us with the distribution 


pP u (x) ~ P (3u (x) ~ 


f dx Hi >i 2 x * e { x j) 



e aiXi 

f dxi e~ otiXi ’ 


where 


oti = 2(3^2 HjjE(xj). 

j 

This approximation P& u (x) is far easier to work 
with than the exact Boltzmann distribution, p f3U (x) = 

~(3G(x) t ' 

e N(0U) i s i nce eac h term in the product is for a single spin 
by itself. In particular, if we adopt this approximation 
we can use numerical techniques to solve the associated 
set of simultaneous equations 

E( Xi ) = e~ aiXi ] Vi 


for the E(xi) (so that those E(xi) are no longer exactly 
equal to the expected values of the {xi } under the distri- 
bution pP u (x)). Given those E(xi) values, we can then 
evaluate the associated approximate Boltzmann distribu- 
tion explicitly. 

The mean field approximation to the Boltzmann dis- 
tribution is a product distribution, and in fact is identical 
to the product distribution q 9 of bounded rational game 
theory, for the team game where g%{x) = 2 (3G(x) Vf. Ac- 
cordingly, the “mean field theory” approximation for an 
arbitrary Hamiltonian U can be taken to be the associ- 
ated team game q 9 , which is defined for any U. 

This bridge between bounded rational game theory and 
statistical physics means that many of the powerful tools 
that have been developed in statistical physics can be 
applied to bounded rational game theory. In particu- 
lar, much work in statistical physics has been done with 
approximating distributions that are higher order than 
products, allowing for coupling between the variables. 
The associated extension of PD theory is a full-blown 
theory of Probability Lagrangians. 

Finally, this bridge can be used to apply PD theoretic 
techniques in statistical physics rather than vice-versa. 
In particular, it is shown elsewhere [20, 21] that if one re- 
places the identical cost function of each player in a team 
game with different cost functions, then the bounded ra- 
tional equilibrium of that game can be numerically found 
far more quickly. In the context of statistical physics, this 
means that numerically solving for a MF approximation 
may be expedited by assigning a different Hamiltonian 
to each particle. 


C. Information-theoretic misfit measures 

The proper way to approximate a target distribution p 
with a distribution from a set C is to first specify a misfit 
measure saying how well each member of C approximates 
p, and then solve for the member with the smallest mis- 
fit. This is just as true when C is the set of all product 
distributions as when it is any other set. 

How best to measure distances between probability 
distributions is a topic of ongoing controversy and re- 
search [27]. The most common way to do so is with the 
infinite limit log likelihood of data being generated by 
one distribution but misattributed to have come from 
the other. This is know as the Kullback-Leibler dis- 
tance [16, 17, 28]: 

KL(pi || p 2 ) = S(pi || p 2 ) - S(pi) (18) 

where S(pi || p 2 ) = - f dx p i(x)ln[^^] is known as the 
cross entropy from pi to P 2 (and as usual we implic- 
itly choose uniform p). The KL distance is always non- 
negative, and equals zero iff its two arguments are identi- 
cal. However it it is far from being a metric. In addition 
to violating the triangle inequality, it is not symmetric 
under interchange of its arguments, and in numerical ap- 
plications has a tendency to blow up. (That happens 
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whenever the support of pi includes points outside the 

^imnnrt nf r)n ^ 

JT X- ~ ' JT ** / 

Nonetheless, this is by far the most popular measure. 
It is illuminating to use it as our misfit measure. As 
shorthand, define the u pq distance” as KL(p || g), and 
the u qp distance” as KL(q || p), where p is our target 
distribution and q is a product distribution. Then it is 
straightforward to show that the qp distance from q to 
target distribution p@ u is just the maxent Lagrangian, 
up to irrelevant overall constants. In other words, the 
q minimizing the maxent Lagrangian — the distribution 
arising in MF theory — is the q with the minimal qp 
distance to the associated Boltzmann distribution. [60] 

However the qp distance is the (infinite limit of the 
negative log of the) likelihood that distribution p would 
attribute to data generated by distribution g. It can be 
argued that a better measure of how well q approximates 
p would be based on the likelihood that q attributes to 
data generated by p . This is the pq distance. Up to an 
overall additive constant (of the canonical distribution’s 
entropy) , the pq distance is 

KL(p || q) = ~J2 J dx pix^nlq^Xi)}. 

This is equivalent to a team game where each coordinate 
i has the “Lagrangian” 


of possible x). Have a single constraint / that restricts 
us to V, the unit simplex in R n , i.e., that restricts us 
to the set of functions that (assuming they are nowhere- 
negative) are probability distributions. Choose V to be 
the associated Lagrangian, -£?(p) = 0E v {G) — S(p), p 
being a point in our constrained submanifold of R n . Note 
that this p can be any distribution over the x’s, including 
one that couples the components {x^}. 

Say we are at some current product distribution g. 
Then we can apply Lemma 1 with the choices just out- 
lined to tell us what direction to move from q in V so 
as to reduce the Lagrangian. In general, taking a step 
in that direction will result in a distribution p' that is 
not a product distribution. However we can solve for the 
product distribution that is closest to that p', and move 
to that product distribution. By iterating this procedure 
we can define a search over the submanifold of product 
distributions. We can then solve for the product distri- 
bution at which this search will terminate. 

To do this, of course, we must define what we mean by 
“closest”. Say that we choose to measure closeness by pq 
distance. Then the terminating production distribution 
is the one for which the marginals of VL + A V/ all equal 
0. For each z, this means that 

J dx^[(5G{x) 4- 1 n(p(x)) + 1 + A] = 0 


L*(q) = - J dxi pi(xi)\n[qi(i)}, 

where Pi(xi) is the marginal distribution f dxy)p(x). 

The minimizer of this is just g* = p* Vz, i.e., each qi 
' is set to the associated marginal distribution of p. So in 
particular, when our target distribution is the canonical 
ensemble distribution p* 317 , the optimal q according to pq 
distance is the set of marginals of pP v . Note that unlike 
the solution for qp distance, here the solution for each 
qi is independent of the q^y So we don’t have a game 
theory scenario; we do not need to pay attention to the 
g(i) when estimating each separate q r . Correspondingly, 
whereas there are many local minima of the team game 
Lagrangian studied above, q £ Q — ► KL(q || p& u ), there 
is only one, global minimum of q € Q — ► KL(jpP || g). 

Another difference between the two kinds of KL dis- 
tance is how the associated optimal product distributions 
are typically calculated numerically. The product distri- 
bution that optimizes the maxent Lagrangian is usually 
found via derivative-based traversal of that Lagrangian, 
or techniques like (mixed) Brouwer updating[20-22, 24, 
42]. In contrast, the integral giving each marginal dis- 
tribution of p is usually found via adaptive importance 
sampling of the associated integral, with the proposal 
distribution for the integral to approximate p, set adap- 
tively, as g(i)[20]. 

It is possible to motivate yet other choices for the q 
that best approximates p @ u . To derive one of them, start 
with Lemma 1, with R n set to the space of real-valued 
functions over the set of x’s (so that n is the number 


at the equilibrium product distribution p. Writing out 
p = YU Qi evaluating gives 

qi(xi) oc exp (19) 

J dx(i) 1 

This is akin to the q 9 of a bounded rational game, except 
that each player/particle i sets its distribution by evalu- 
ating conditional expected U with a uniform distribution 
over the x^y rather than with qyy 

D. Semi-coordinate transformations 

Let’s say there are numerical difficulties with our find- 
ing a q that is local minimization of the maxent La- 
grangian. That q might still be a poor fit to p(x) if it 
is far from the global minimizer of the Lagrangian. Fur- 
thermore, even the global minimizer might be a poor fit, 
if p(x) simply can’t be well-approximated by a product 
distribution. 

There are many techniques for improving the fit of a 
product distribution to a target distribution in machine 
learning and statistics [28]. To give a simple example, 
say one wishes to approximate the target distribution in 
R^ with a product of Gaussians, one Gaussian for each 
coordinate- Even if the target distribution a Gaussian, if 
it is askew, then one won’t be able to do a good job of 
approximating it with a product of Gaussians. However 
one can use Principal Components Analysis (PCA) to 
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find how to rotate one’s coordinates so that a product of 
Gaussians fits the target exactly. 

Similar techniques can address both the issue of break- 
ing free of local minima of the Lagrangian, and improving 
the accuracy of the best product distribution approxima- 
tion to p . More precisely, identify.# with the variables z 
discussed in Sec. II L. Then consider changing the map 
£(.) : x z from the identity map. This will in general 
change the mapping from P x to j£?*(£(P x )). So if Jzf 2 is 
the Lagrangian we are interested in, the mapping from 
product distributions over x can be changed by changing 
£(.), in general. 

As an example, consider the case where the space of x’s 
is identical to the space of z’s, and consider all possible 
bijective transformations £(.). Entropy is the same in 
both spaces for any C, i.e., S(P Z ) = S(((P X )) = S(P X ). 
So for fixed P x , the entropy in z-space is independent of 
£(.). However if we fix P x and change £(.) the expected 
values of utilities will change. So Jz? z (((P x )) does depend 
on ((.), as claimed. 

This means that by changing £(.) while leaving q x un- 
changed, we will in general change whether we are at a 
local minimum of (C(<7x))- Furthermore, such a change 
will change how closely the global minimizer of 3f z (£(q x )) 
approximates any particular target distribution. Indeed, 
some such transformation will always transform a team 
game to have a strictly convex maxent Lagrangian, with 
only one (bounded rational) equilibrium, an equilibrium 
that is in the interior of the region of allowed q and 
that has the lowest possible value of the Lagrangian. 
In the worst case, we can get this behavior by trans- 
forming to the semi-coordinate system in which x is one- 
dimensional, so that any p(z) — coupling its variables or 
not — can be expressed as a q{x) = qi(xi). 

Note that unlike with PCA, semi-coordinate transfor- 
mations can be used for non-Euclidean semi-coordinates 
(i.e., when neither x’s nor z ’ s are Euclidean vectors). 
They also can be guided by numerous measures of the 
goodness of fit to the target distribution (e.g., KL dis- 
tance), in contrast to PCA’s restriction to assuming a 
Gaussian likelihood. 


ers. Say we have a finite population of players, each of 
which has one of ml possible types. (These are some- 
times called feature vectors in the literature.) Each 
player i in the population is randomly paired with a dif- 
ferent player j ) and they each choose a strategy for a two- 
person game. The set of strategies each of those players 
can choose among is fixed by its respective attribute vec- 
tor. In addition the cost player i receives depends on the 
attribute vectors of itself and of j, in addition to their 
joint strategy. Finally, to reflect this dependence, we al- 
low each player to vary its strategy depending on the 
attribute vector of its opponent; we call player V s meta- 
strategy the mapping from its opponent’s attribute vec- 
tor to z’s strategy. [61]. 

We encode an instance of this scenario in an x with 
a countably infinite number of dimensions. x i} o = Tii(x) 
specifies the number of players of type z, with n(x) be- 
ing the vector of the number of players of all types. For 
1 < j < ar i|0 , x it j = the meta-strategy selected 

by the j’th player of type i. If its opponent is the j’th 
player of type T', the cost to the i’th player of type T 
is 9T t i,T'j(x) = gT,i,T'j( s ,s',n T ,n T '), where s and s' 
are the two players’ respective meta-strategy. To enforce 
consistency between the index numbers i,j and the asso- 
ciated numbers of players, we set ,j(s, s', n) = 0 if 

either i > nr or j > nr 1 • 

To start we parallel the GCE, and presume that for 
each type we know the expected number of players hav- 
ing that type, and the expected cost averaged over all 
players having that type. Also stipulate that the distri- 
bution over x is a product distribution, q. Then our prior 
information specifies the values of 

0 ^ 9 t , o (^) = Y2x TQ X T,o Qt,o( x t,o) 

and 


5Zn:n r > 0 zL T , ;n ^ # >0 f 1 f ^ S t^ S T ' 

M r f 1 Qtj ( 5 r)9 T / >fc ( S t> S t' > 

E 1 - 


E. Bounded rational game theory for variable 
number of players 


y ...y 

“* 1,0 ^ >X T < 0>° 


■Ex 


m/ f 0 


E r , 'L J ,kJ dx T, 1 dx T ', k 


The bridge between statistical physics and bounded ra- 
tional game theory have many uses beyond the practical 
ones alluded to the previous subsection. In particular, 
it suggests extending bounded rational game theory to 
ensembles other than the canonical ensemble. As an ex- 
ample, in the GCE the number of particles of the various 
allowed types is uncertain and can vary. The bounded 
rational game theory version of that ensemble is a game 
in which the number of players of various types can vary. 

We can illustrate this by extending a simple instance 
of evolutionary game theory [6] to incorporate bounded 
rationality and allow for a finite total number of play- 


t = 1 

Qtj ( X T,j ) ^T',k ( X T' t k^ $ T,j,T f ,k j 
#*T, 0 El" X T", 0 

respectively, for all types T. (The sums over j and k all 
implicitly extend from 1 to oo, and the delta functions 
are Kronecker deltas that prevent a player from playing 
itself.) 

We can write these expressions as expectation values, 
over x, of 2 ml functions. These functions are the ml 
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functions tit(x) = xt , o (one function for each T ) and 
+be tpJ f un ctions 


ct(x) — ] 9 T j, T ',k 

X T, 0 J2t" X T",0 


(*)} 


e ( X T, o) 


respectively, where © is the Heaviside theta function that 
equals 1 if its argument exceeds 0, and equals 0 otherwise. 
Accordingly, the maxent principle directs us to minimize 
the Lagrangian 


■?(q) = -Y>T{E{n T )-N T )+h{E{c T )-C T )} - S(q) 

T 


where the integers {Nt} and real numbers { Ct } axe our 
prior information. In the usual way, the solution for each 
pair (i € {!,. . . ,m'}, j > 0) is 


that are best performing, in the sense that no other pair 
has a lowpr cost function value The distribution for fi- 
nite 7 can be viewed as a “bounded rational” extension 
of conventional evolutionary game theory. In that exten- 
sion (type, strategy) pairs are allowed even if they don’t 
have the lowest possible cost, so long as their cost is close 
to the lowest possible [62], 

There is always a solution to this Lagrangian (un- 
like the case in conventional full rationality evolutionary 
game theory). The technique of Lagrange parameters 
provides that solution for each pair (i € {1, . . . , m'}, j > 
0) in the usual way: 




where the Lagrange parameters enforce our constraint, 
and 


q i .{x ij )oce ^ T t n T t P T ' c t' 1 1 x ij ) ? 

where the values of the Lagrange parameters are all set 
by our prior information. 

This distribution is analogous to the one in the GCE. 
As usual, one can consider variants of it by focusing on 
one variable at a time, having prior knowledge in the 
form of rationality values, etc. In addition, even if we 
stay in this random- 2-player games scenario, there is no 
reason for us to restrict attention to prior information 
paralleling that of the GCE. As with bounded rational 
game theory with a fixed number of players, our prior 
information can concern nonlinear functions of q , couple 
the cost functions, etc. 

In particular, in evolutionary game theory we do not 
know the expected number of players having each type, 
nor their average costs. In addition, the equilibrium con- 
cept stipulates that all players will have type T if a par- 
ticular condition holds. That condition is that the addi- 
tion of a player of type other than T to the population 
results in an expected cost to that added player that is 
greater than the associated expected cost to the players 
having type T. This provides a model of the phenotypic 
interactions underlying natural selection. 

We can encapsulate evolutionary game theory in a La- 
grangian by appropriately replacing each pair of GCE- 
type constraints (one pair for each type) with a single 
constraint. As an example, we could have the (single) 
constraint for type T be that 


E{ 


n. 




-) = E(i 


max r c T , -c T 


max T ,(c r/ )-min r/ (c r ,) 


’) (20) 


for some positive real value 7. For finite 7, the entropy 
term in the Lagrangian ensures that for no T is the expec- 
tation value in the lefthand side of this constraint exactly 
0. 

In the limit of infinite 7, the distribution minimizing 
this Lagrangian is non-infinitesimal only for the evolu- 
tionarily stable strategies of conventional evolution- 
ary game theory. These are the (type, strategy) pairs 


f T ,(x) EE =-^ — [ ^ r. 

V' maX T" ( C T" ) - mn T- ( C T" ) 

More general forms of evolutionary game theory al- 
low games with more than two players, and localization 
via network structures delineating how players are likely 
to be grouped to play a game. Other elaborations have 
each player not know the exact attribute vectors of all its 
opponents, but only an “information structure” provid- 
ing some information about those opponents’ attribute 
vectors. All such extensions can be straightforwardly in- 
corporated into the current analysis. Many other exten- 
sions are simple to make as well. For example, since the 
cost functions have all components of h in their argument 
lists, they can depend on the total size of the population. 
This allows ills to model the effect on population size of 
finite environmental resources. 

Note that if we change how we encode the number of 
players of the various types and their joint meta-strategy 
in x, we change the form of the expectations in Eq. 20. 
This reflects the fact that by changing the encoding we 
change the implication of using a product distribution. 
Formally, such a change in the encoding is a change in 
the semi-coordinate system. See Sec. II L. 


IV. APPENDIX 

This appendix provides proofs absent from the main 
text. 


A. 0^i[E P {gi)} 2 — S(p) is convex over the unit 
simplex 

Proof: Since S(p) is concave over the unit simplex, 
and the unit simplex is a hyperplane, it suffices to prove 
that EjE’ptei)] 2 is convex over all of Euclidean space. 
Since a weighted average of convex functions is convex, 
we only need to prove that any single function of the form 
[/ dx p(x)f(x )] 2 is convex. The Hessian of this function 
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is 2 f(x)f(x / ). Rotate coordinates so that / is a basis 
vector, i.e., so that / is proportional to a delta function. 
This doesn’t change the eigenvalues of the Hessian. After 
this change though, the Hessian is diagonal, with one 
non-zero entry on the diagonal, which is non-negative. 
So its eigenvalues are zero and a non-negative number. 
QED. 


B. Rkl is a rationality operator 

Proof: Since KL distance only equals 0 when its ar- 
guments match and is never negative, requirement (1) 
of rationality operators holds for Rkl- Next, since 
Rkl = argmin^ [/? / dy p(y)U (y) -f \n(N((3U))\, we know 

that Ep(U ) = — N(pu) 9/3 "I p=Rkl(u,p)- Accordingly, 
all p with the same rationality have the same expected 
value E P (U ). Using the technique of Lagrange parame- 
ters then readily establishes that of those distributions 
having the same expected f7, the one with maximal en- 
tropy is a Boltzmann distribution. Furthermore, by re- 
quirement (1), we know that for a Boltzmann distribu- 
tion the exponent (3 must equal the rationality of that 
distribution. QED. 


C. Alternative form of a constraint on Rkl 


Proof: Let /{a, u} be any function that is monotoni- 
cally decreasing in its (real- valued) first argument. Then 
any constraint R([gi\i, q , qi)—Pi = 0 is satisfied iff the con- 
straint f{R([gi]i, q ,qi),q(i)} ~ /{p»,9(»)} = 0 is satisfied. 
Choose 

5111(^0%]*, ,)) 

/{ Q >9(t)} = op 1/3=0 

= f dxilgi^e-^’^ 

Differentiating this quantity with respect to a gives the 
negative of the variance of [gi]i, q under the Boltzmann 

distribution 4f 


-<*( 9 ik,q 


y Since variances are non-negative, 
this derivative is non-positive, which establishes that / 
is monotonically decreasing in its first argument. 
Evaluating, 


/{p«,9(«)} = J dxgi(x ) 


g PiL^{gi\xi) 




N(pi9i) 

In addition, from the equation defining Rkl > we know 
that 


In (N(pU{x i ))) l 
d(3 


= / 


j {3=Rk L{U, qi) / dXiQi{Xi)U {X{) 


for any function U. Plugging in U — \gi\i, q > we see that 
f {R([9i\i,q) qi)i q(i)) = j ( X i) (^i) 


= EM- QED. 


D. q 9 minimizes the Lagrangians of Eq. 10 

Proof: Following Nash, we can use Brouwer’s fixed 
point theorem to establish that for any non-negative {pi), 
there must exist at least one product distribution given 
by q 9 . The constraint term in all the ^ of Eq. 10. is 
zero for this distribution. By requirement (2), we also 
know that given q 9 ^ (and therefore there is no 

qi with rationality pi that has lower entropy than q: f. 
Accordingly, no qi will have a lower value of Since 
this holds for all i , q 9 minimizes all the Lagrangians in 
Eq. 10 simultaneously. QED. 


E. Derivation of Lemma 1 

Proof: Consider the set of u such that the directional 
derivatives D$fi evaluated at x' all equal 0. These are 
the directions consistent with our constraints to first or- 
der. We need to find the one of those u such that D^g 
evaluated at x f is maximal. 

To simplify the analysis we introduce the constraint 
that |u| = 1. This means that the directional derivative 
DuY for any function V is just u * W. We then use La- 
grange parameters to solve our problem. Our constraints 
on u are Ylj u j = ^ an( ^ Az/ifaO = u ■ Vfi(x') = 0 Vi. 
Our objective function is DaV(x') = u * W(x'). 

Differentiating the Lagrangian gives 

2A 0 «i + ^AiV/ = VK Vi. 


with solution 


vv' - Ei Aj v/ 


Aq enforces our constraint on |u|. Since we are only in- 
terested in specifying u up to a proportionality constant, 
we can set 2A 0 = 1. Redefining the Lagrange parameters 
by multiplying them by —1 then gives the result claimed. 

QED. 


F. Proof of claims following Lemma 1 

i) Define f t (q) = f dxiqi(x t ), i.e., f % is the constraint 
forcing qi to be normalized. Now for any q that equals 
zero for some joint move there must be an i and an x[ such 
that q t (Xi) = 0. Plugging into Lemma 1, we can evaluate 
the component of the direction of steepest descent along 
the direction of player V s probability of making move x[: 


. A dfi 

dqi(xi) dqi(Xi) 

PE{gi | Xi) -bln(g<(xi)) - 


fdx?\pE(gi | x'l) + lnfaQ'/))] 
fdx'll 
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Since there must some x” such tha g t (x") ^ 0, 3x x such 

RP.(n- I rr r . n \ J~ ]rt(n is finit.p Thprpfnrp. nnr 

, -~ x f • — vit % / / 

component is negative infinite. So -£f t can be reduced by 
increasing q t (s'). Accordingly, no q having zero prob- 
ability for some joint move x can be a minimum of i’s 
Lagrangian. 

ii) To construct a bounded rational game with multiple 
equilibria, note that at any (necessarily interior) local 
minimum g, for each i, 

0E( 9i | Xi) + ln( 5 ,(x i )) = 

P / dx {l) g l (x,,x {l] )Yl < ij( x 3) + 

J j*i 

must be independent of x by Lemma 1. So say 
there is a component-by-component bijection T(x) = 
(ri(xi),r 2 (x 2 ),...) that leaves all the {g 3 } unchanged, 
i.e., such that g 3 {x) = gj(T(x)) Vx, j [63]. 

Define q[ by g'(x) = q(T(x)) Vx. Then for any two 
values x\ and x \ , 


g z (x') Vz,x*,x', i.e., g is uniform. Say the Hessians of 
the players’ Lagrangians are not all positive definite at 
the uniform q. (For example have our congestion game 
be biased away from uniform multiplicities.) Then that 
q is not a local minimum of the Lagrangians. Therefore 
at a local minimum, q ^ q(T ). Accordingly, q and g(T) 
are two distinct equilibria. 

iii) To establish that at any g there is always a direction 
along which any player’s Lagrangian is locally convex, fix 
all but two of the fe}, go and gi, and fix both go and gi 
for all but two of their respective possible values, which 
we can write as go(0), go(l), gi(0), and gi(l), respectively. 
So we can parameterize the set of g we’re considering by 
two real numbers, x = go(0) and y = gi(0). The 2x2 
Hessian of Jzfi as a function of x and y has the entries 

1 1 

- H a 

x a — x 

1 1 

a — h r 

V b-y 


pE q ' {gi | x?) + ln(q'(xj)) where a = 1 — 9o(0) — ^o(l) and b = 1-qi (O)-gi(l), and 

- 0E q , ( gi \x?) + ln(<j' (x?)) Q is a Action of 9l and ri i# o,i <h • Defining s = ± + ^ 

__ and t = ^ 4- the eigenvalues of that Hessian are 

0 J dx {i)gi (x\,x {i) ) + ln(q t (T(x}))) s + t± v/4a 2 + (s - f) 2 

2 • 

0 J dx( i )g l (x t , X(,j) qj(T(xj))) + ln(q l (T(x l ))) e jg enva j ue f or th e positive root is necessarily posi- 
J ^ z tive. Therefore along the corresponding eigenvector, 

/ is convex at g. QED. 
dx^giixlT^ix^)) + ln(&(T(zl))) 

iv) There are several ways to show that the value of 


j#* 


„ f , . 9 nnT t , xx , , o^E a B flm ) must shrink as ft grows. Here we do so by 

0 J dx( l ' l g i (x i , T (x (i) )) n^( x i)) + ln (^( T ( x i ^Caluating the associated derivative with respect to ft. 

Ortfinrt ATfTT\ C Ant o U (\j') fir,,, nrtrm olinrof irvn ean 


P [ ^(ijSi^!),^))) n^( x i) + ^(^(rCx, 1 ))) 

J j¥» 


^evaluating the associated derivative with respect to ft. 
Define N(U) ~ f dy e~ u ( y \ the normalization con- 
stant for the distribution proportional to e~ u ( y \ View 
the x t -indexed vector qf as a function of ft, g t and 
q^y So we can somewhat inelegantly write E(g t ) = 

~pJ Then ° ne Can expand 


J#* 


| r(x t x )) + lnte^ 1 ))) 

- 0E q ( 9i | T(xf)) + ln(*(r(x?))) 


dE( 9i ) _ &*]n (N(0i( 9i ] i<gw )) 

d0i dp? 

= -Var([ 5l ] l ,, (i) ) 


where the invariance of g x was used in the penultimate 
step. Since q is a local minimum though, this last differ- 
ence must equal 0. Therefore g' is also a local minimum . 

Now choose the game so that Vz,Xi,T(x*) / x*. (Our 
congestion game example has this property.) Then the 
only way the transformation q q(T) can avoiding 
producing a new product distribution is if gi(x*) = 


where the variance is over possible x T , sampled according 
to qf (xi). QED. 
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\ ' ( f TP ( M 2 _L A To (fPW Ty-» o f'f'r\-rA nri-f Vi r'Am ^onfinnol 
Tj ' - — — - — — 

game theory and the axiomatization of utility, here we 
assume players axe interested in expected costs (negatie 
utilities), not variances in those costs. 
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ln( 7 r(x)); the issues involved in approximating the Boltz- 
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approximating distributions. 

[61] Note that it is trivial to replace meta-strategies with 
strategies throughout the analysis below: simply restrict 
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[62] Many other parameterized constraints will result in this 
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[63] As an example, consider a congestion team game in which 
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